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Discussion of "Multiple Testing for 
Exploratory Research" by J. J. Goeman 
and A. Solari 
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Abstract. Goeman and Solari [Statist. Sci. 26 (2011) 584-597] have 
addressed the interesting topic of multiple testing for exploratory re- 
search, and provided us with nice suggestions for exploratory analysis. 
They defined properties that an inferential procedure should have for 
exploratory analysis: the procedure should be mild, flexible and post 
hoc. Their inferential procedure gives a lower bound on the number of 
false hypotheses among the selected hypotheses, and moreover when- 
ever possible identifies elementary hypotheses that are false. The need 
to estimate a lower bound on the number of false hypotheses arises 
in various applications, and the partial conjunction approach was de- 
veloped for this purpose in Biometrics 64 (2008) 1215-1222 (see also 
Philos. Trans. R. Soc. Lond. Ser. A 367 (2009) 4255-4271 for more 
details). For example, in a combined analysis of several studies that 
examine the same problem, it is of interest to give a lower bound on 
the number of studies in which the finding was reproduced. I will first 
address the relation between the method of Goeman and Solari and the 
partial conjunction approach. Then I will discuss possible extensions 
and address the issue of exploration in more general settings, where 
the local test may not be defined in advance or where the candidate 
hypotheses may not be known to begin with. 



1. RELATION TO THE TESTING OF PARTIAL 
CONJUNCTION HYPOTHESES 

Let Hi, . . . , H n be the elementary hypotheses. The 
idea of giving a lower bound on the number of false 
elementary hypotheses (or equivalently an upper 
bound on the number of true elementary hypothe- 
ses) appears in [1], and is closely related to the 
tests of partial conjunction hypotheses. The par- 
tial conjunction null hypothesis H u / n in [1] asks 
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whether fewer than u of the elementary hypothe- 
ses are false, and the alternative hypothesis is that 
at least u of the elementary hypotheses are false. 
Testing whether H u / n is false at a significance level 
a in order (i.e., for u = 1,2,...) results in a 1 — a 
confidence lower bound on the number of false ele- 
mentary hypotheses: 

Theorem 1.1. Letp u l n be a partial conjunction 
p-value for testing H u l n . Let ii max = maxjtt ;p % / n < 
a \fi = 1, . . . , u}. Then with 1 — a confidence, the 
true number of false hypotheses is in [u max ,n\. 

Proof. Let k be the true number of false ele- 
mentary hypotheses. If k = n, that is, all elemen- 
tary hypotheses are false, there is nothing to prove. 
If k < n, 

Pr(A; > Umax) = 1 - Pr(& < n max ) 

= 1 - p r (p( fc+1 )/ n < a) > 1 - a. □ 
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The lower bound u max above is identical to the 
lower bound of Goeman and Solari (denoted by 
f a {l, . . . ,n} in their paper), when the full set of 
elementary hypotheses is considered. Moreover, the 
shortcuts suggested by Goeman and Solari are equiv- 
alent to the tests of partial conjunction hypotheses 
suggested in [1], that do not require examination 
of all ( n _" +1 ) intersection hypotheses for the test 
of H u l n , but rather require only testing the subset 
of n — u + 1 intersection hypotheses that correspond 
to the n — u + 1 least significant elementary hypothe- 
ses p- values. Specifics follow. 

Reference [1] suggested methods for combining the 
p- values for testing H u / n that are based on sufficient 
combining functions. 

Definition 1.1. f(U±, . . . ,U m ) is a sufficient 
combining function from 3ft m —t- 9ft if it has the fol- 
lowing properties: 

1. If U!> Ui, then /(E7i,..., U i -i,Ui,U i+1 ,...,U m )> 
f(Ui, Ui-!,Ui, U i+ i, U m ), that is, / is an 
increasing function of its components. 

2. If Ui is uniformly distributed or stochastically 
larger than the uniform, that is, Ui^U (0, 1) Vi = 

St 

l,...,n, then f(U 1 ,...,U m )hU(0,l). 

st 

Let < • • • < pc n \ be the sorted p-values. The 
following lemma gives the guiding principle for the 
p-values suggested in [1] for testing the partial con- 
junction hypothesis: 

Lemma 1.1. Let f(Ui,...,U n - u+ i) be a suffi- 
cient combining function from 9ft n_u+1 — > 3ft. Letp u / n 
be the result of combining the largest n — u + 1 p- 
values using the function f , that is, p u / n = f(pf u \, . . . , 

p {n) ). Then Yi{P u l n <a)<a if H u l n is true. 

For example, if the p-values are independent the 
p-value motivated by the Fisher method for test- 
ing H u / n is 

p u/re = Pr(x 2 V U +i)>- 2 E lo gP«)- 

\ i=u / 

Finding ii ma x using the partial conjunction test p- 
values based on Fisher's method will give the same 
result as the procedure in Section 4.1 of Goeman and 
Solari, when the full set of elementary hypotheses is 
considered. 

Similarly, if a set R C {1, . . . , n} is selected a pri- 
ori, then the lower bound on the number of false 
hypotheses may be found by testing in order the 



partial conjunction hypotheses 

p u/\R\ jU= 1,2,..., 
where \R\ is the cardinality of R. If the set R is 
selected post hoc, then the lower 1 — a confidence 
bound on the number of false hypotheses may be 
lower than the bound resulting from the above pro- 
cedure because of the selection effect, and the proce- 
dures suggested by Goeman and Solari can be used 
to adjust for the selection effect. 

2. MULTIPLE FAMILIES OF HYPOTHESES 
IN EXPLORATORY RESEARCH 

In [1], the partial conjunction approach was used 
to estimate the lower bound on the number of false 
hypotheses when a large number of such lower bounds 
need to be estimated simultaneously. In multiple 
testing for exploratory research, a similar problem 
may arise. Consider, for example, a large genomics 
study, where the signal in many genes (or SNPs) are 
measured simultaneously. In order to select genes (or 
SNPs) for follow-up, the researcher may want to se- 
lect a subset of promising genes from prespecified 
regions in the genome. In such a problem, in each 
region a subset of promising genes (or SNPs) may 
be selected by exploration of that region. 

When exploring multiple families of hypotheses, 
in order to limit the total number of false leads, the 
decision about the subset of hypotheses selected for 
follow-up in each family may be affected by the esti- 
mated lower bounds on the number of false null hy- 
potheses in the subsets selected in other families of 
hypotheses. Moreover, the researcher may be inter- 
ested in a lower bound on the number of false leads 
at the level of families rather than at the level of el- 
ementary null hypotheses. These are natural exten- 
sions to the problem addressed by Goeman and So- 
lari, where multiple testing may be applied to multi- 
ple families of hypotheses in an exploratory manner. 

3. THE CHOICE OF THE LOCAL TEST 

The approach of Goeman and Solari assumes that 
the test of each intersection hypothesis is known 
in advance. However, it may be difficult to decide 
which local test is best without first looking at the 
data. 

In some applications, we may not always have 
a good statistic in mind for evaluating an elemen- 
tary null hypothesis. We may need to explore the 
data in order to decide on a good test statistic for 
testing the null hypothesis. However, when testing 
the elementary hypothesis on the data explored to 



DISCUSSION 



3 



decide on the test, the test is no longer a valid test 
in the sense that there is no guarantee it preserves 
the level of the test. 

Moreover, when we have several elementary hy- 
potheses of interest and we want to test their inter- 
section hypothesis, how should the test statistic be 
chosen? Different tests will have power against dif- 
ferent alternatives. Even if we limit ourselves to tests 
that are based on combining functions of the ele- 
mentary hypotheses p-values, different functions are 
better capable of detecting different patterns of ev- 
idence against the intersection null hypothesis, and 
the differences among them can be large (see, e.g., [7] 
and [4]). Because no single combining function can 
be best under all circumstances, in exploratory anal- 
ysis the researcher may choose a combining function 
by exploring different combining methods. The cho- 
sen method may then be used on data from follow- 
up studies. However, for testing the intersection hy- 
potheses on the data explored, the test is no longer 
a valid test. 

Therefore, if the data are explored to select which 
local test to use, the confidence sets may no longer 
have the correct level and may be misleading. Nev- 
ertheless, the use of multiple testing for selecting 
hypotheses for follow-up is still valuable tool, 
even though it is not possible to quantify the num- 
ber of false leads in the selected subset of hypotheses 
for follow-up. 

4. THE PRACTICE OF EXPLORATORY 
RESEARCH 

Even when multiple comparisons issues are ad- 
dressed, still studies are too often not reproducible 
(see [6]) and scientists follow too many false leads. 
This may be because together with advances in mul- 
tiple comparisons over the years, there have been 
many advances in how data can be explored. The 
multiple comparisons correction is possibly done only 
on a subset of hypotheses without intention. From 
sophisticated (and even simple) graphical displays, 
a hypothesis may be generated. But how can one 
quantify then how many potential hypotheses have 
actually been tested before selecting the particularly 
interesting one based on the picture? If the user can- 
not quantify how many hypotheses may be looked 
at in the exploratory stage, how should the data be 
analyzed to select promising hypotheses to follow 



up on while still quantifying the error in terms of 
a lower bound on the number of false null hypothe- 
ses? 

One possibility is to define the hypotheses on part 
of the data by creative exploratory analysis and then 
apply the multiple testing procedure on the rest 
of the data (see [3]). The problem is that by test- 
ing only part of the data we lose power. Therefore, 
a modest change in current practice may be the fol- 
lowing: to set aside only the amount of data that 
the investigator is willing to spare for the purpose 
of generation of hypotheses and in order to decide 
what local test to use for each hypothesis. So, for ex- 
ample, from a study of 500 subjects the investigator 
may be willing to set aside 100 subjects, and from 
a sample size of 100 perhaps only 15 subjects may 
be set aside for hypothesis generations. Once the hy- 
potheses and tests of hypotheses have been decided 
upon, the procedure of Goeman and Solari may be 
applied. This process is mild, flexible and post hoc 
without losing all ability to quantify the confidence 
on the estimated number of false positives among 
the selected hypotheses. 
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